Lexical access for large-vocabulary speech recognition
نویسندگان
چکیده
In this paper, the lexical characteristics of two Chinese dialects and American English are explored. Different lexical representations are investigated, including the tonal syllables, base syllables, phonemes, and the broad phonetic classes. Multiple measurements are made, such as coverage, uniqueness, and cohort sizes. Our results are based on lexicons of 44K and 52K words in Chinese and English obtained from the CallHome Corpus and the COMLEX Corpus, respectively. We have found that the set of the most frequent 4,000 words has coverage of 92% and 77% for Chinese and English, respectively. The phonetic representation unique specifies 85%, 87% and 93% of the lexicon for Mandarin, Cantonese, and English, respectively. While the three languages appear quite different when they are described by their full phoneme sets, their characteristics are more similar when they are represented in terms of broad phonetic classes.
منابع مشابه
Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise
Vocabulary size has been suggested as a useful measure of "verbal abilities" that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is les...
متن کاملModeling Lexical Tones for Mandarin Large Vocabulary Continuous Speech Recognition
Modeling Lexical Tones for Mandarin Large Vocabulary Continuous Speech Recognition
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملA New Decoder Design For Large Vocabula
An important problem in large vocabulary speech recognition for agglutinative languages like Turkish is the high out of vocabulary (OOV) rate caused by extensive number of distinct words. Recognition systems using words as the basic lexical elements have difficulty in dealing with such virtually unlimited vocabulary. We propose a new time-synchronous lexical tree decoder design using morphemes ...
متن کاملRecognition of out-of-vocabulary words with sub-lexical language models
A major source of recognition errors, out-of-vocabulary (OOV) words are also semantically important; recognizing them is, therefore, crucial for understanding. Success, so far, has been modest, even on very constrained tasks. In this paper we present a new approach to unlimited vocabulary speech recognition based on using graphemeto-phoneme correspondences for sub-lexical modeling of OOV words,...
متن کاملThe influence of lexical-access ability and vocabulary knowledge on measures of speech recognition in noise.
OBJECTIVE The main objective was to investigate the effect of linguistic abilities (lexical-access ability and vocabulary size) on different measures of speech-in-noise recognition in normal-hearing listeners with various levels of language proficiency. DESIGN Speech reception thresholds (SRTs) were measured for sentences in steady-state (SRTstat) and fluctuating noise (SRTfluc), and for digi...
متن کامل